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Abstract. Communities of highly connected actors form an essential feature in the 
structure of several empirical directed and undirected networks. However, compared to 
the amount of research on clustering for undirected graphs, there is relatively httle under- 
standing of clustering in directed networks. 

This paper extends the spectral clustering algorithm to directed networks in a way that 
co-clusters or bi-clusters the rows and columns of a graph Laplacian. Co-clustering lever- 
ages the increased complexity of asymmetric relationships to gain new insight into the 
structure of the directed network. To understand this algorithm and to study its asymptotic 
properties in a canonical setting, we propose the Stochastic Co-Blockmodel to encode 
co-clustering structure. This is the first statistical model of co-clustering and it is de- 
rived using the concept of stochastic equivalence that motivated the original Stochastic 
Blockmodel. Although directed spectral clustering is not derived from the Stochastic Co- 
Blockmodel, we show that, asymptotically, the algorithm can estimate the blocks in a high 
dimensional asymptotic setting in which the number of blocks grows with the number 
of nodes. The algorithm, model, and asymptotic results can all be extended to bipartite 
graphs. 



1. Introduction 

A diverse set of data sources that are characterized by interacting units or actors can 
be easily represented as a network or graph. Social networks, representing people who 
communicate with each other, are one example. There are two basic types of networks, di- 
rected and undirected. The relationships in directed graphs are asymmetric. For example, 
in a communication network, one person calls the other person. In an undirected graph, all 
the relationships are symmetric. Although the network literature has typically addressed 
undirected networks, assuming symmetric relationships is often a simplification. Sev- 
eral types of relationships are asymmetric. For example, directed relationships compose 
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communication networks, citation networks, web graphs, and internet networks. Even net- 
works that are often represented as undirected networks (e.g. Facebook friendships and 
road networks) are simplifications of an underlying directed network. In Facebook, each 
friendship is proposed by one of the friends and received by the other friend. This induces 
an asymmetry. Facebook users also interact asymmetrically (e.g. "posting on walls," "lik- 
ing posts," and "sharing posts"). In road networks, one is often interested in the flow of 
traffic; anyone who has a reverse commute can confirm that traffic flows asymmetrically. 
In biochemical cellular networks, a relationship represents the flow of information and/or 
energy in the cell. These are causal graphs, and causality without direction is merely cor- 
relation. In a wide range of applications, directed networks more accurately represent the 
collected data and possibly the corresponding data generating mechanism. 

Just as it is common to remove edge direction and study the resulting undirected net- 
work, the extant literature on clustering algorithms has focused almost exclusively on undi- 
rected networks and presumed symmetric extensions for directed networks. One way to 
symmetrize a graph is to add an opposing edge to each observed edge. So, if the data 



contains an edge from a to b, then impute an edge from b to a. Satuluri and Parthasarathy 



12011] propose 3 additional ways of symmetrizing directed networks for the purpose of 
clustering. However, there is additional information contained in the edge direction. This 
paper proposes co-clustering the nodes with a spectral method. Co-clustering leverages the 
increased complexity of directed relationships to allow for a richer set of possible studies 
on the structure of directed graphs. 

Co-clustering (a.k.a. bi-clustering) was first proposed in Hartigan| [ 1972| for Euclidean 
data arranged in a matrix M E M"^'^. Where standard clustering techniques cluster the 
rows of M into kr clusters, co-clustering simultaneously clusters the rows of M into kr 
clusters and clusters the columns of M into kc clusters. In the past decade, co-clustering 
has since become an important data analytic technique in biological applications (e.g 
Madeira and 01iveira| pU04l |, [Tanay et al.| p004l, [Tanay et al.| pOOSl, [Madeira et al 



j lOlOP ), text processing (e.g. |Dhillon| [ |200H , [Bisson and Hussain] pOOSI ), and natural 
language processing (e.g. [Freitag [2004|, Rohwer and Freitag [2004 1). Banerjee et al. 



|2004 1 suggests three advantages of co-clustering over traditional clustering. First, if the 
d columns are clustered into kc «. d clusters, then it is easier to interpret why the row 
clusters are different. This is because the central point of each row cluster can now be 
represented in W'" instead of the much larger W^. Second, clustering both the rows and the 
columns dramatically reduces the number of parameters, implicitly providing statistical 
regularization. To see this, notice that most clustering algorithms imply an approximate 
matrix decomposition M ^ Lji, where each row of M is replaced by the central point 
of the cluster to which it belongs. In L/i, L E {0, i}"^^'^'^^ L-j = 1 if row i is in cluster 
and the jth row of /i contains the central point of cluster j. Co-clustering 

krxk,^ E {0, ij'^'x'^-, and Ruv = 1 if col- 
LjiR^. By decomposing /i into filC^ , 



ikrXd 



further decomposes jj, = ^R^, where /i E 
umn u (of M) is in column-cluster v. Thus, M 
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co-clustering provides statistical regularization by reducing the number of parameters. Fi- 
nally, because there are fewer parameters to optimize, co-clustering algorithms have the 
potential to be faster than traditional clustering algorithms. For example, the computa- 
tional complexity of a standard k-means step is ndkr and the computational complexity of 



one iteration of an analogous co-clustering algorithm is O^nkcK + dkcK) [Banerjee et al. 



20041. 



The previous applications of co-clustering have involved matrices where the rows and 
columns index different sets of objects. For example, in text processing, the rows cor- 
respond to documents, and the columns correspond to words. Element z, j of this matrix 
denotes how many times word j appears in document i. This paper applies co-clustering to 
an asymmetric square matrix where the rows and columns index the same set of nodes. As 
such, each node i is in two types of clusters, one corresponding to row i and the other cor- 
responding to column i. The iih. row of this matrix identifies the outgoing edges for node 
i. The ith column of this matrix identifies the incoming edges for node i. So, two rows 
are in the same co-cluster if they send edges to several of the same nodes; two columns 
are in the same co-cluster if they receive edges from several of the same nodes. In this 
way, co-clustering with directed graphs can lead to different types of interpretations and 
insights than previous applications of co-clustering (where the rows and columns index 
different sets). 

This paper proposes and studies a spectral co-clustering algorithm called DI-SIM. The 
name Dl-SIM has three meanings. First, because Dl-SIM co-clusters the nodes, it uses two 
distinct (but related) similarity measures between nodes: "the number of common parents" 
and "the number of common offspring." In this sense, DI-SIM means two similarities. 
Second, Dl- also denotes that this algorithm is specifically for ^^/rected graphs. In fact, co- 
clustering on symmetric graphs produces two sets of redundant clusters. This is because 
row i is equal to column i. Additionally, Dl-SIM, pronounced "dice 'em," dices data into 
clusters. 

The rest of this paper is organized as follows: Section |2] defines Dl-SIM and briefly 
explains how it differs from related algorithms. To study the performance of Dl-SIM and 
conceptualize why it is reasonable to co-cluster the nodes in a directed graph. Section 
[3] discusses the notion of "stochastic equivalence" in directed graphs and proposes the 
Stochastic Co-Blockmodel to encode co-clustering structure. This model and the notion of 
"stochastic equivalence" allow insight into the graph asymmetries and the type of structure 
that co-clustering could estimate. Section |4] studies the asymptotic properties of Dl-SIM 
on the Stochastic Co-Blockmodel. Theorem |4.1| shows that under certain conditions, Dl- 
SIM can correctly co-cluster (i.e. estimate the co-blockmodel membership) of most nodes, 
even in the high dimensional setting where the number of blocks grows with the number 
of nodes. Section |5]presents two simulations investigating the two conditions in Theorem 
4.1 Section [6] concludes the paper. 
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2. DI-SIM; A CO-CLUSTERING ALGORITHM FOR DIRECTED GRAPHS 

The next subsection gives a brief overview of the classical spectral algorithm for undi- 
rected graphs. This motivates Dl-SIM. A more detailed account of spectral clustering with 
undirected graphs can be found in |von Luxburg| [|2007[ . 



2.1. Spectral clustering for undirected graphs. Spectral clustering is a popular and 
computationally feasible algorithm for clustering the nodes of an undirected graph. It 
has a rich history in the algebraic connectivity of graphs. Since the seminal work of 



Donath and Hoffman [ 1973 1 and Fiedler [ 1973 1, the algorithm has been rediscovered and 
reapplied in numerous different fields. The algorithm is popular because it provides a com- 
putationally tractable approximation of the Cheegar constant, which is NP hard to exactly 
compute gChungl [T997| |Shi and Malikl [2000| |von Luxburgj |2007l |. Because of this, com- 
puter scientists have found many applications for variations of spectral clustering, such as 
load balancing and parallel computations [ Van Driessche and Roose[ 1995, Hendrickson 
1995[ |, partitioning circuits for very-large-scale integration design [Hagen and 



and Leland 



Kahng[ |1992 | and sparse matrix partitioning [ [Pothen et al.[ |1990| . [Spielman and Teng 



12007) and von Luxburg et al.| [ |2008| give detailed histories of the algorithm. 



Shi and Malik [2000 1 popularized the use of spectral clustering for image segmentation. 
Since then, the algorithm has received significant attention in various machine learning 
applications. In the machine learning literature, the problem setup for spectral clustering is 
slightly different than the network approach in this paper. That literature presumes the data 
points lie in a metric space and a graph is constructed based on some measure of similarity 
between the points in this metric space. Several researchers have assumed that the data 
points lie on (or close to) a manifold and studied the asymptotic properties of the adjacency 
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Others have studied the the adjacency matrix, the 



graph Laplacian, and spectral clustering under different conditions [ [Koltchinskii and Gm6 
20501 |von Luxburg et al.[ [lOOgj [Shi et alj [20091 |Rohe et al.[ |20TT] [. 

To define the spectral clustering algorithm, let G = {V,E) denote a graph, where V is 
a vertex set and E is an edge set. The vertex set V = {vi, . . . , w„} contains vertices or 
nodes. These are the actors in the systems discussed above. We will refer to vertex Vi as 
node i. This paper considers unweighted, directed edges. So, the edge set E contains a 
pair if there is an edge, or relationship, from node i to node j: i j. The edge set 
can be represented by the adjacency matrix A E {0, 1}"^": 



(2.1) 



1 if {i,j) is in the edge set 
otherwise. 
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The graph is undirected if G E =^ G E. The adjacency matrix of such a 

graph is symmetric. The graph is directed if there exists a pair of nodes {i,j} such that 
G E and (j, i) ^ E. In this situation, the adjacency matrix is asymmetric. 
The graph Laplacian is a function of the adjacency matrix. It is fundamental to spectral 
graph theory and the spectral clustering algorithm [Chung, 1997[ von Luxburg, 2007 [ . For 



symmetric adjacency matrix A, define the symmetric graph Laplacian L*^*^ and diagonal 
matrix D both elements of M"^" in the following way, 

Some readers may be more familiar defining L*^*) as / — D^^^'^AD~^^'^. For spectral clus- 
tering, the difference is immaterial because both definitions have the same eigenvectors. 
Below is one version of the spectral clustering algorithm. 



Spectral clustering according to [Shi and Malik| pOOO | 



Input: Symmetric adjacency matrix A G {0,1}"^", number of clusters k. 

(1) Find the eigenvectors Xi, . . . ,Xk G M" corresponding to the k largest eigen- 
values of L^^\ L^*^ is symmetric, so choose these eigenvectors to be orthonor- 
mal. Form the matrix X = [Xi, . . . , X^l G M"^^ by putting the eigenvectors 
into the columns. |f] 

(2) Treating each of the n rows in X as a point in M*^, run A;-means with k clusters. 
This creates k non-overlapping sets Ai, . . . , A^ with union 1, . . . , n. 

Output: Ai, . . . ,Ak. This means that node i is assigned to cluster g if the zth row of X 
is assigned to Ag in step 2. 



"I Shi and Malik 2000) do not compute the eigenvector matrix X using L'-'^\ Instead, they compute the 



largest generalized eigenvectors to the generalized eigenproblem (D — A)x = Dx. These formulations 
are mathematically equivalent. 



This algorithm requires a symmetric adjacency matrix. As such, the graph must be 
undirected. Dl-SIM, the co-clustering algorithm proposed in this paper is similar to the 
algorithm above. However, to accommodate an asymmetric matrix Dl-SIM replaces the 
eigendecomposition with the singular value decomposition (SVD). SVD is defined as fol- 
lows. 

Definition 1. Singular value decomposition (SVD) factorizes a matrix M G W^^'^ (n > d) 

into the product of orthonormal matrices U G M"^"', V G W^^*^ and a diagonal matrix 
S G W^^'^ with nonnegative entries. 

M = U^V^. 
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The columns of U contain the left singular vectors. The columns of V contain the right 
singular vectors. The diagonal of S contains the singular values. If the matrix M is square 
and symmetric, then SVD is equivalent to eigendecomposition and U — V. \n this way, 
SVD is a generalization of the eigendecomposition. 

2.2. The DI-SIM algorithm. Similarly to the undirected spectral clustering algorithm, 
DI-SIM Utilizes a graph Laplacian. To define a graph Laplacian L e R"^" for directed 
graphs, first define the n x n diagonal matrices P and O. 

Pa is the number of nodes that send an edge to node i, or the number of parents to node i. 
Similarly, On is the number of nodes to whom i sends an edge, or the number of offspring 
to node i. The DI-SIM algorithm is defined as follows. 



DI-SIM 

Input: Adjacency matrix A e {0, number of row-clusters kr, number of 

column-clusters kc- 

(1) Compute the singular value decomposition of L = UTV'^. Discard the 
columns of U and V that correspond to the n — A; smallest singular values 
in E, where k — m.m.{kr, kc\. Call the resulting matrices e W^^^ and 

yk g ]^nxfc_ 

(2) Cluster the columns of L by treating each row of V'^ as a point in M.^. Cluster 
these points into kc clusters with A; -means. Because each row of corre- 
sponds to a node in the graph, the resulting clusters are clusters of the nodes. 

(3) Cluster the rows of L by performing step two on the matrix with kr clusters. 

Output: The clusters from step 2 and 3. 



This algorithm is a generalization of the undirected spectral clustering algorithm. If A 
is the adjacency matrix of an undirected graph, then A is symmetric and P = O. This 
implies that L = L^^^ is a symmetric matrix. Where the undirected spectral clustering 
algorithm uses the eigenvectors of L^^\ the di-sim algorithm above uses the singular 
vectors of L. However, when L is symmetric, the eigenvectors of L are equivalent to the 
singular vectors of L. In this way, DI-SIM applied to an undirected graph is equivalent to 
spectral clustering. 

SVD is a matrix decomposition technique that, on symmetric matrices, is equivalent to 
the eigendecomposition. This decomposition is an essential ingredient in Dl-SIM. We are 



(2.3) 



P = 



La — 
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not the first to apply SVD in a graph setting. The next subsection briefly highlights the 
previous algorithms that employ SVD to explore the structure of graphs. 

2.2. 1 . Related SVD methods. Dl-SIM provides novel insights into the structure of directed 
graphs by co-clustering the nodes with the left and right singular vectors. Several other 
researchers have used SVD to explore and understand different features of networks. For 
example, Dhillon ]2001[ used SVD to co-cluster bipartite graphs, [Kleinberg [ 1999 1 used 



SVD in a web search algorithm that was a precursor to Google's PageRank algorithm, Hoff 



[ 2009] used SVD to fit a random effects model to directed network data, and jSussman et al 



1 201 1 1 used SVD to cluster the nodes of a directed graph. 

Dhillon pOOl} suggested an algorithm similar to DI-SIM that was to be applied to bi- 
partite graphs in which the rows and columns of L correspond to different entities (e.g. 
documents and words). As such, L is rectangular. However, the definition of L remains 



the same, and the top singular vectors of this matrix play a similar role in both Dhillon 



|2001 1 and DI-SIM. There are four important differences between our algorithms. First, in 
DI-SIM, the rows and columns of L index the same set of nodes. So, each node is clustered 
into two types of clusters. In Dhillon, [ |2001J , the rows and columns index different sets of 
objects. As such, the interpretation of the clusters is different. Second, Dhillon clusters 
the rows and columns into the same number of clusters. Dl-SIM allows for kc ^ K- Third, 
Dhillon uses far fewer singular vectors. To find k clusters, Dhillon uses i = \\0g2 k~\ sin- 
gular vectors (\x~\ is the smallest integer greater than x). While this makes the algorithm 
much faster, the current paper suggests that the singular vectors between i and k can con- 
tain additional information. Finally, where Dl-SIM runs k-means on t/^ and V'' separately. 



Dhillon] 12001 1 runs k-means on the rows of the matrix 

p-l/2ye 

where and are the left and right singular vectors corresponding to the top i singular 
values. 

Kleinberg [ 1999[ proposed the concept of "hubs and authorities" for hyperlink-induced 



topic search (HITS). This algorithm that was a precursor to Google's PageRank algorithm 
]Page et aL| 1999| . To find authoritative sources of information, HITS "propose[d] an 



algorithmic formulation of authority, based on the relationship between a set of relevant 
authoritative pages and the set of 'hub pages' that join them together in the link structure" 
Kleinberg] [ 19991. To perform a web search, the HITS algorithm first initializes with a 



set of roughly 100 websites that could have been found with the standard text analysis 
algorithms in 1999. Second, it constructs the subgraph composed of all sites that link to 
or are linked from those 100 pages. Finally, it takes the top left singular vector and the top 
right singular vector of the resulting adjacency matrix. These left and right singular vectors 
respectively assign each website a hub-ness and authoritativeness score. HITS uses only 
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the top singular vectors for web search. Dl-SIM uses the top several singular vectors for 
clustering. 

Hoff| [ 2009| applied SVD to directed graphs to estimate "sender-specific and receiver- 



specific latent nodal attributes." He used the following random effects model, 

exp(/3^Xjj + % 



P{A\Z) = l[- 



+ exp(/3^Xjj + Zi. 



where Xy contains observed covariates and Zij is an "unobserved latent effect." The matrix 
Z is modeled as a low rank matrix using the SVD of residuals from the model fit without 
Z. In Hoff 's words, "this approach allows for the graphical description of a social network 
via the latent factors of the nodes, and provides a framework for the prediction of missing 
links in network data." The left and right singular vectors estimate the "sender-specific" 
and the "receiver-specific traits." 



In a separate line of research, Sussman et al. [2011 1 studied how the singular vectors 
of a random adjacency matrix (generated from the standard Stochastic Blockmodel) can 
estimate the block memberships. Sussman et al. cluster the nodes of the network, they do 
not co-cluster the nodes. Further, their Stochastic Blockmodel does not encode the concept 
of co-clustering. They use the left and right singular vectors in a way that effectively 
ignores edge direction. 

SVD has been used in other statistical methodology as well, such as correspondence 
analysis (CA) and canonical correlation analysis. In fact, Dl-SIM normalizes the rows and 
columns in an identical fashion to CA. CA is a multivariate methodology with a long and 
storied existence. CA has similarities to principal components analysis, but it is applicable 
to categorical data. The methodology was first published in Hirschfeld [ 1935[ and, like 



spectral clustering, it has been rediscovered and reapplied several times over [Guttman 



1959[ . This method has become popular in ecology and has been refined in several differ- 
ent ways [ |Hill[|1979 Ter Braakj T986J. It is particularly popular among French ecologists 



]Holmes[ 2006 1. While there are strong algorithmic connections to CA, the language and 
techniques of CA have not yet been applied to network data. This is a potentially fruitful 
area for further research. 



2.3. Interpreting the singular vectors. This subsection examines how the singular vec- 
tors of L use the similarity measures "number of common parents" and "number of com- 
mon offspring." Recall that SV D ex presses a matrix M E M"^'^, as the product of three 
matrices, M = UT.V^. Lemma 2.1 shows how to compute the matrices U, V, and S, giv- 
ing insight into the similarity measures used by Dl-SIM. The lemma follows from Lemma 
7.3.1 in |Hom and Johnson] [ |2005 1 . 
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Lemma 2.1. With SVD, M = UllV^ for M G W""^. The matrices U and V contain the 
eigenvectors to the symmetric, positive semi-definite matrices MM'^ and M'^M respec- 
tively. Both MM'^ and M'^M have the same eigenvalues and these values are contained 
in the diagonal of TP'. IfM is symmetric, U contains the eigenvectors ofM and U = V. 



This implies that Dl-SIM uses the eigenvectors of two matrices, L L and LL [Klein 



berg 1999 1. To understand these matrices, first look at A^A and AA^. When A is the 
adjacency matrix of a directed graph, A'^A and AA^ are two symmetric similarity ma- 
trices that correspond to "the number of common parents" and "the number of common 
offspring." 

{A''^A)ab = a and x — )■ 6} : The number of common "parents." 

X 

{AA'^)ab = l{a — ^ X and 6 — )■ x} : The number of common "offspring" 

X 

These two similarity matrices are symmetric and easily interpretable. LL^ and L^L per- 
form a similar task while down- weighting the contribution of high degree nodes. 

J. 1 l{x — i- a and X — i- 6} 

^ V^aa^bb ^ '^xx 

,jjT\ ST^ T T ^ l{a — i- X and 6 — i- x} 

[LL )ab = } , Lg^Lbx = I ^ = / . T> 

^ y^aaSJbb ^ ^xx 

The next sections explicate the difference between these two similarity measures by ex- 
ploring the relationship between Dl-SIM and "stochastic equivalence," a concept discussed 
in Section [3] Using the concept of "stochastic equivalence," Section |3] introduces a new 
Stochastic Co-Blockmodel. This new model will serve as a test bed to examine the two 
types of clusters produced by Dl-SIM. Section]?} shows that DI-SIM can asymptotically 
estimate the block structure in the Stochastic Co-Blockmodel. 



3. Stochastic Co-Blockmodel 

In a directed network, the "number of common parents" between two actors a and h 
is the number of actors that send a directed relationship to both a and h. The "number 
of common offspring" between a and h is the number of actors that receive a directed 
relationship from both a and h. To study these similarity measures, this paper relates the 
observable measurements to two model based notions of similarity. 



3.1. Stochastic equivalence, a model based similarity. The Stochastic Blockmodel [ |Hol 
land et al?| |1983[ in social network analysis is motivated by a model based notion of sim- 
ilarity called stochastic equivalence. Holland et al. [ 1983J defines stochastic equivalence 
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thuslyj^ "We say two nodes a and b are stochastically equivalent if and only if the prob- 
ability of any event about A is unchanged by interchanging nodes a and 6." This means 
that elements of A indexed by a are exchangeable with elements of A indexed by b. In an 
undirected network, two actors are stochastically equivalent if only if they connect to any 
third actor with equal probabilities: 

P{a x) = P{b ^ x) Wx, 

where a x denotes the event that a and x are connected. In a directed network, the defi- 
nition of Holland et al.| [[1983 1 implies that two nodes a and b are stochastically equivalent 



if and only if both of the following hold: 

(3.1) P{a^x) = P(6^x)Vxand 

(3.2) P{x ^ a) = P{x -> b) Vx 

where a — )■ s denotes the event that a sends an edge to x. 

The Stochastic Blockmodel builds off the notion of stochastic equivalence. In the Sto- 
chastic Blockmodel, each node belongs to a cluster, or block, and two nodes in the same 
block are stochastically equivalent. Conversely, if two nodes are stochastically equivalent, 
then they are in the same block. 

To align with the concept of co-clustering, this paper suggests a relaxation of the notion 
of stochastic equivalence for directed graphs. Instead of one type of stochastic equivalence 
which implies both Equations p.l| ) and p.2| ), one can relax this notion and allow two 



different types of stochastic equivalence. Two nodes a and b are stochastically equivalent 



senders if and only if Equation (3.1 1 holds. Two nodes a and b are stochastically equivalent 



receivers if and only if Equation p.2[ ) holds. These two concepts correspond to a model 
based notion of co-clusters and they are simultaneously represented in the new Stochastic 
Co-Blockmodel. 

3.2. A statistical model of co-clustering in directed graphs. The Stochastic Block- 
model provides a model for a random network with k well defined blocks, or communities 



[ Holland et al.[ [1983[ . Each node in the Stochastic Blockmodel is a member of exactly 
one block, and any two nodes within the same block connect to any third node with an 
equal probability. That is, nodes in the same block are stochastically equivalent. Further, 
edges are statistically independent. The following is a definition of the classical Stochastic 
Blockmodel. 

Definition 2. Define two nonrandom matrices, Z E {0, 1}"^*= and B E [0, 1]'^^^. Each 
row of Z has exactly one 1, each column has at least one 1, and B is symmetric. Under 
the Stochastic Blockmodel, the adjacency matrix A E {0, 1}"^" is symmetric and random 



'Our notation has been substituted for the notation in Holland et al. (1983 
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such that E{A) = ZBZ^. Additionally, each edge is independent, so the probability 
distribution factors 

p(A)=nm.)- 

If the graph is undirected, then the product is only taken over the set i < j. 

In this definition, Zia = 1 if the ith node is a member of the ath block and Bab is the 
probability of a connection from a node in the ath block to a node in the bth block. Define 
Zi to be the ith row of Z. Notice that if Zi = Zj, then the ith row of E(A) is equal to 
the jth row of E{A) and the ith column of E{A) is equal to the jth column of E{A) : 
ZiBZ^ = ZjBZ'^ and ZBzi = ZBZj. So, two nodes in the same block are stochastically 
equivalent. By assuming an identifiability type assumption on B, the converse is also true. 
For example, it is sufficient for B to be full rank. 

The Stochastic Co-Blockmodel is an extension of the Stochastic Blockmodel. 

Definition 3. Define three nonrandom matrices, Y G , Z G {0,1}"^'^^ and 

B G [0, Each row ofY and each row of Z has exactly one 1 and each column has 

at least one 1. Under the Stochastic Co-Blockmodel, the adjacency matrix A G {0, 1}"^" 
is random such that E{A) = YBZ^ . Further, each edge is independent, so the probability 
distribution factors 

id 

In the Stochastic Blockmodel E{A) = ZBZ'^. In the Stochastic Co-Blockmodel, 
E{A) = YBZ^. In this definition, Y and Z record two types of block membership 



which correspond to the two types of stochastic equivalence (Equations (3.1 1 and (|3.2[)). 



Proposition 3.1. Under the Stochastic Co-Blockmodel, let Ui be the ith row ofY. Ifui 



Uj, then nodes i and j are stochastically equivalent senders. Equation p.l[ ). Similarly, if 
Zi = Zj, then nodes i and j are stochastically equivalent receivers. Equation ( 3.2[ ). 



Proof. If Ui = Uj, then the i and jth rows of E{A) are equal. Thus, nodes i and j send 
edges to any third node x with equal probability. This implies that nodes i and j are 



stochastically equivalent senders. Equation p.l[ ). Similarly, if Zi = zj, then the i and jth 



columns of E{A) are equal. Thus, nodes i and j receive edges from any third node x with 
equal probability. This implies that nodes i and j are stochastically equivalent receivers. 



Equation (3.2). □ 



Proposition 3.1 shows that the Stochastic Co-Blockmodel encodes the two types of 
stochastic equivalence in the matrices Y and Z. These two matrices encode the two types 
of clusters that co-clustering estimates. 

While the Stochastic Co-Blockmodel provides a model based notion of co-clustering, it 
can be thought of as a reparameterization of the Stochastic Blockmodel. If one starts with 
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a Stochastic Co-Blockmodel with ky and kz blocks, then one can create a traditional Sto- 
chastic Blockmodel with (at most) kykz blocks. In the traditional Stochastic Blockmodel, 
two nodes are in the same block if and only if the are stochastically equivalent parents and 
they are stochastically equivalent offspring. So, by adding a block for every combination 
of a F-block and a Z-block, the Stochastic Blockmodel can parameterize the same model. 
Parameterizing a Stochastic Co-Blockmodel as a Stochastic Blockmodel creates a prob- 
lem in estimating the matrix of probabilities B. Under the classical model, estimating this 
matrix requires estimating up to (kykz)"^ probabilities. In the Co-Blockmodel, estimating 
B requires estimating kyk^ probabilities. In this sense, every Stochastic Co-Blockmodel 
is a Stochastic Blockmodel with fewer parameters and a unique interpretation. 

A previous versions of the Stochastic Blockmodel for directed networks, proposed in 
Wang and Wongl [|1987|, allows for random variable pairs and Aji to be statistically 



dependent; this is a fundamental difference between the Stochastic Co-Blockmodel and 
the previously defined Stochastic Blockmodels for directed graphs. Definition |3] does not 
allow for any stochastic dependence. In this sense, [^ng and Wong [ 1987 1 is more general 
than Definition |3| That said, the goal of the Stochastic Co-Blockmodel is to encode co- 
clustering structure. Statistical independence is assumed only for mathematical ease. 

This paper uses Dl-SIM to estimate the partitions in Y and Z. We study this spectral 
algorithm and avoid the maximum likelihood estimator (MLE) for two reasons. First, Dl- 
SIM is computationally feasible while the MLE is computationally intractable to compute. 
Second, various versions of spectral methods have proved useful in a wide range of appli- 
cations and appear to give more reasonable answers than techniques which rely on discrete 



optimization [Leskovec et al. , 2008 1 



The next section gives conditions under which Dl-SIM can asymptotically estimate the 
block partitions in the matrices Y and Z. This implies that the two notions of stochastic 
equivalence relate to the two sets of singular vectors of L. 



4. Asymptotic performance of di-sim under the Stochastic 

Co-Blockmodel 



Theorem 4.1 bounds the number of nodes that Dl-SIM "misclusters" asymptotically. 



This demonstrates that the co-clusters from Dl-SIM estimate the two types of block mem- 
bership, one in matrix Y and the other in matrix Z, corresponding to the two types of 
stochastic equivalence. 

In a diverse set of large empirical networks, the optimal clusters, as judged by a wide 
variety of graph cut objective functions, are not very large [|Leskovec et al.[ |2008[. To 



account for this, the results below limit the growth of community sizes by allowing the 



number of communities to grow with the number of nodes. Previously, Rohe et al. [201 1 1 



and Choi et al. [2012 1 have studied this high dimensional setting for the undirected Sto- 



chastic Blockmodel. 
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Rigorous discussions of clustering require careful attention to identifiability. In the 
Stochastic Co-Blockmodel, the order of the columns of Y and Z are unidentifiable. This 



leads to difficulty in defining "misclustered." Rohe et al. [20111 gives a reasonable and 



tractable definition of misclustered that this paper extends to co-clustering. Sections C.l 



and C.2 (in the Appendix) motivate and present the retrofitted definition. 



4.1. Main result. Define = E(A) as the population version of the adjacency matrix 
A. Under the Stochastic Co-Blockmodel, 

£/ = YBZ^. 

Define population versions of O, P, and L all in M"^" as 

(4.1) ^ii = Y.k^ik 

= ^-1/2^^-1/2 

where and ^ are diagonal matrices. The following definitions provide for an alternative 
definition of J^. Define 

Dy = diag{BZ^ln) G M^^^*^'' 
D, = diagillYB) e R''''"'^ 

where diag{x) for x G i?"' is a diagonal matrix in M'^^'^ with diag{x)jj = xj. The follow- 
ing is an alternative expression for 

S£ = YB^Z^. 
Define P-^^x ^ be the population of the largest block in Y. 

(4.2) PL, = . max (F^F),, 

J = l,...,ky 

Define B-^ as the jth column of B-^, and define 



(4.3) 7. = Vi^l„min||i?.f -Pflh, 

where P^^in is the population of the smallest block in Y. The next theorem bounds the size 
of the set of F-misclustered nodes \^y\ and the size of the set of Z-misclustered nodes 



\^z\- The definitions of \^y\ and \^z\ are presented in the Appendix, in Sections C.l 
andQ 

Theorem 4.1. Suppose A G M"^" is an adjacency matrix sampled from the Stochastic 
Co-Blockmodel with ky left blocks and right blocks. Assume ky < k^. Define ^ as in 
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( 4J_ I. Define cti > (T2 > • ■ ■ > <^ky > a 5 the ky nonzero singular values of ^ . Define 



and as the sets ofY- and Z-misclustered nodes, as in ( |C.l 1[ ) and ( |C.15[ ). Define 



mill {m.m{0'ii, ^ii}) /n. 

i=l,...,n 



Define P^ax ( |4.2[ ) anJ 7^ a* m ( |4.3| ). Assume there exists N such that for all n > N, 
> 2/logn. Ifn-^/^{logny = 0(cTfcJ, then 



(4.4) 



(4.5) 



(logn)^ 



A proof of Theorem 4. 1 is contained in Appendix |C] 

The bound on \^z\ includes 77^. The next proposition makes an additional assumption 
to simplify this quantity. 

Proposition 4.1. Under the Stochastic Co-Blockmodel, if all expected in- and out-degrees 
are equal, 

^ii = ^jj = nTn for all i and j, 

then 

/ pv 



where B.i is the ith column of B. 



mm • II Tj 

mm ll-D-i 



Under the assumptions of Theorem 4.1 and Proposition [43 



(4.6) 



n(log 



n 



\B.,-B. 




One might be concerned with the n in the numerator of the right hand side of 4.6 The end 
of this section discusses the proportion of misclustered nodes, Myjn and ^^/n. After 
dividing by n, the in the denominator can drive the bound to zero. 

The bound for appears to exceed the bound for My. In fact, if ky = k^, then 
Mz can be bounded as in ( |4.4[ ) with P^^^ replaced with an analogous quantity for 



There is only a problem when ky < k^. Rank(=Sf ) is at most ky. So, the singular value 
decomposition represents the data in ky dimensions and the k-means steps for both the left 
and the right clusters are done in ky dimensions. In estimating Y , there is one dimension in 
the singular vector representation for each of the ky blocks. At the same time, the singular 



If ky — k 



then the definition of in t he app endix, should be changed to a definition analogous to 
. The current definition of .y^^ in Equation C.15 is formulated to allow kz > rank{^). 
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value representation shoehorns the kz blocks in Z into less than dimensions. So, there 
is less space to separate each of the k^ clusters, obscuring the estimation of Z. 

are 



The two main assumptions of Theorem 4.1 



(1) n-V2(logn)2 = OKJ 

(2) eventually, logn > 2. 

The first assumption, a spectral gap condition, requires that the smallest nonzero singular 
value of ^ is large enough. The second assumption ensures that the expected degree 
of each node grows sufficiently fast. If r„ is constant, then the expected degree of each 
node grows linearly. The assumption > 2/ log n is almost as restrictive. These two 
conditions imply the sufficient conditions needed to show that the top ky singular vectors 
of L converge to the top ky singular vectors of That theorem is contained in the 
Appendix. 



To understand the bound in Theorem 4.1 define the following toy model. 



Definition 4. The four parameter Stochastic Co-Blockmodel is a Stochastic Co-Blockmodel 
parameterized A; G N, s G N, r G (0, 1), andp G (0, 1) such that p + r < 1. The matri- 
ces Y,Ze {0, l}"^'^ each contain s ones in each column and B = pl^ + rl^l^. 

In the four parameter Stochastic Co-Blockmodel there are k left- and right-blocks each 
with s nodes and the node partitions in Y and Z are not necessarily related. If yi = Zj, 
then P(i — ?■ j) = p + r. Otherwise, P{i j) = r 

Corollary 4.1. Assume the four parameter Stochastic Co-Blockmodel, with r and p fixed 
and k growing with n = ks. Then, 

1 



k{r/p) + 1 ' 

where ak is the kth largest singular value of Ifk = 0{n^^^ / logn), then 

\Jiy\ + \ Jiz\ = (k^log^n) . 
Further, the proportion of nodes that are misclustered converges to zero, 

{\^y\ + \^z\)/n = o{n-'/') . 



The proof of Corollary 4.1 is contained in Appendix^ 



^Corollary 4. 1 adopts a definition of that is analogous to ./£'y because in the four parameter model 
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5. Simulations 

The three simulations in this section address three different aspects of the theoretical 



results in Theorem 4.1 The first two simulations investigate r and respectively. These 
simulations investigate Dl-SlM's non-asymptotic sensitivity to these quantities under the 
four parameter Stochastic Co-Blockmodel. The third simulation compares di-SIM's per- 
formance to its theoretical guarantees as kz grows and ky remains fixed. 

fc-means is the final step of Dl-SIM. A;-means often gets trapped in local optima. To avoid 
this problem, this section initializes the /c-means centers to the rotated population singular 
vectors ZfiO. In preparing these simulations, this initialization was particularly helpful in 
the more challenging simulations, where greater than 50 of the nodes are misclustered. 

In the following simulations, any simulated graph that had a node with no in-edges or no 
out-edges was discarded. The entire graph was re-simulated until every in- and out-degree 
was nonzero. In this sense, every simulated graph is conditioned to have strictly positive 
in- and out-degrees. 

5.1. Simulations 1 and 2. The first two simulations come from the four parameter model 
with 30 blocks (k = 30 = ky = k^) and 30 nodes in each block (s = 30). In the first set of 
simulations, the probabilities p and r vary in such a way that r decreases and remains 
fixed. In the second set of simulations, p and r vary in such a way that increases and r 
remains fixed. Because ky = k^, the first two simulations only investigate the estimation 
of Z. Analogous results would apply to Y. 



Theorem 4.1 says, perhaps surprisingly, that the estimation performance only depends 
on n, P, T, and cTfc. In the four parameter model, these quantities do not depend on Y. 
There are three setups corresponding to three different designs for Y and Z. In all designs, 

= 30 and s = 30. 

• In the second setup, Y = Z. 

• In the first setup, Y and Z are chosen uniformly at random from the set of all 
possible partitions that ensure each block contains s = 30 nodes. 

• In the third setup, the blocks are "non-overlapping", meaning that if yi = yj, then 

Zi 7^ Zj for all pairs i and j. 

These three designs range from Y and Z being the same to Y and Z being completely 
different. Simulations 1 and 2 investigate whether making Y and Z more similar improves 
the estimation of the partition. By examining these three different setups, the first two 
simulations show that the estimation of Z is neither improved nor hampered if Y and Z 
are more similar. 



5.1.1. Simulation 1. One drawback of Theorem 4.1 is that it requires an asymptotically 
dense graph. Very few empirical networks display such an edge density. This simulation 
investigates the sensitivity of Dl-SIM to a diminishing number of edges. To ensure that 
these results are not confounded by the effects of (the spectral gap), the values of p 
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and r change such that r decreases while cxfc stays constant. By letting r vary and defining 
p = 8r,T changes while 



{k{r/p) + lf (30(1/8) + !)' 

Figure [T] displays the simulation results for a sequence of 100 equally spaced values of r 
between .02 and .05. To decrease the variability, each simulation was run 10 times. Only 
the average is displayed. 



More edges make clustering easier 




tau 



Figure 1. This simulation uses the four parameter Stochastic Co- 
Blockmodel with A; = 30 and s = 30. The probabilities p and r vary 
such that p = 8r. This simulation shows that for very small values of r, 
Dl-SIM performs poorly. The three separate lines correspond to the three 



different designs for Y and Z described in the bullet points in Section 5.1 



These lines do not appear significantly different suggesting that the estima- 
tion of Z is neither hampered nor improved if Y is made more similar to Z. 
(See the bullet points at the beginning of Section [5TT] for the three different 
designs of Y and Z.) 



Figure [T] demonstrates two things. First, the number of misclustered nodes increases 
as r — )• 0. This shows that Dl-SIM performs poorly for small values of r. Second, all 
three lines are nearly overlapping. This shows that the design of Y and Z do not affect 
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the performance of the algorithm. The two extremes are that Y = Z or that the blocks are 
"non-overlapping," i/i = yj =^ Zi ^ Zj. The third line represents the random design, 
where block memberships are assigned randomly. Dl-SIM is unaffected by these design 
choices. 



5.1.2. Simulation 2. Theorem 



4.1 



requires that n^^/^(logn)^ = O(crfc). This condition 
implies that the singular vectors that represent the block memberships are well sepa- 
rated from the singular vectors that do not represent the block memberships. This sim- 
ulation investigates the sensitivity of Dl-SIM to under the four parameter Stochastic 
Co-Blockmodel. To ensure that these simulation results are not confounded by the effects 
of r, the values of p and r change such that r remains constant. By letting p change and 
defining 

1 p 
~ 27 ~ 30' 

r remains fixed at 1/27 ~ .037. Figure |2] displays the simulation results for a sequence 
of 100 equally spaced values of p between .05 and .36. To decrease the variability, each 
simulation was run 10 times. Only the average is displayed. 

Figure |2] demonstrates two things. First, the lines are decreasing. This indicates that, 
as (jfe (the spectral gap) increases, Dl-SIM can correctly cluster more nodes. Secondly, all 
three lines are nearly overlapping. This suggests that, under the four parameter Stochastic 
Blockmodel, the design of Y and Z do not interact with the spectral gap (o"fc) to impact the 
performance of the algorithm. 



5.2. Simulation 3. In the two previous simulations, ky = k^. This simulation examines 
the unbalanced scenario, where ky is fixed and k^ grows. In this simulation, each block in 
Y has an equal number of nodes (n / ky) and each block in Z has an equal number of nodes 
(n/kz). The size of each Z block is fixed at 20 (20 = n/kz). Because kz is growing, n is 
growing, and the size of each block in Y is also growing. 

In the previous simulations, the matrix 5 is a diagonal matrix plus a constant matrix. 
In the unbalanced setting B is rectangular, so the previous model is no longer applica- 
ble. Instead, B G [0, Ij'^f^'^^ is generated with iid Uniform{.04:, .4) random variables, 
removing columns until the columns are sufficiently separated. Specifically, the following 
pseudocode explains how B is simulated. Define Bi to be the ith column of B. 

Sampling G [0, 1]^"^^^ for Simulation 3 

Initialize B^v ~ Uniform{.04:, 4) iid for all u E I, ky, v G I, ... , kz- 
for( f in 2 : kz) 
while minj=i^...^j_i \\Bj — Bv\\2 > -OGky 
Biy ~ f/m/orm(.04.4) iid for alH G 1, . . . , ky. 
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Figure 2. This simulation uses the four parameter Stochastic Co- 
Blockmodel with k = 30 and s = 30. The probabilities p and r vary 
such that r = 1/27 — p/30. This simulation shows that after a crosses a 
threshold, between .02 and .04, DI-SIM performs well. The three separate 
lines correspond to the three different designs for Y and Z described in the 
bullet points in Section |5.1[ These lines are indistinguishable, again sug- 



gesting that the estimation of Z is neither harder nor easier if Y is made 



more similar to Z. (See the bullet points at the beginning of Section 5.1 for 
the three different designs of Y and Z.) 



The constraint that \\Bj — -Bt,||2 > -OGky helps prevent 7, defined in Equation (|4.3p 



from becoming too small. The definition of 7 relies on B-^ . However, for computational 



considerations, the constraint \\Bj — By\\2 > 



.06ky is enforced on B and not B-^ . 



Figure 3]suggests that the proportion of misclustered nodes converges to zero faster than 
Theorem 4]T] implies. The horizontal axis in this figure corresponds to k^ growing from 5 
to 65 in intervals of 5. Each Z block contains 20 nodes. Each Y block has an equal number 
of nodes and this size grows with k^. At each point, B is generated 50 times. For each 
of these S's, one graph is simulated. The dashed line corresponds to the theoretical rate 
of convergence (Equation |4.5| divided by n) averaged over the 50 simulations. The solid 
line corresponds to the proportion of nodes that Dl-SIM Z-misclusters averaged over the 
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50 simulations. The slope of the solid line appears steeper than the dashed line, suggesting 



that there is room to improve our theoretical results in Theorem 4.1 



Figure |3] does not display the average number of F-misclustered nodes. For every sim- 
ulation in which > 10, there were no F-misclustered nodes, \^y \ = 0. 
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Figure 3. This simulation sets ky = 5 and allows k^ to grow from 5 
to 65 in intervals of 5. At each point, the data is generated 50 times. The 
dashed line represents the theoretical rate of convergence for the proportion 



of misclustered nodes (this is Equation ( |4.5[ ) divided by n). The solid line 
gives the proportion of nodes Z-misclustered. Notice that the solid line 
appears to have a faster slope than the dashed line. This suggests there is 
room for improvement in our theoretical results. 



6. Discussion 

By extending both spectral clustering and the Stochastic Blockmodel to a co-clustering 
framework, this paper aims to better conceptualize clustering in directed graphs. Our hope 
is to present co-clustering as a meaningful procedure for directed networks and that this 
helps to guide the development of reasonable questions and sensible similarity measures 
for network researchers. 

In particular, Section[2]introduces the Dl-SIM algorithm. By using the singular value de- 
composition, this algorithm uses the similarity measures "number of common parents" and 
"number of common offspring" to co-cluster the nodes into two different partitions. Sec- 
tion |3] motivates and introduces the Stochastic Co-Blockmodel that encodes the concepts 
of co-clustering in a statistical model. The classical Stochastic Blockmodel employs the 
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fundamental concept of stochastic equivalence; two nodes in the same block are stochasti- 
cally equivalent. Section[3]argues that one can allow for two distinct concepts of stochastic 
equivalence in directed graphs, 

Stochastically equivalent senders: P (a ^ x) = P {b x) Vx 
Stochastically equivalent receivers: P (x — )■ a) = P (x — )■ 6) Vx. 

This prompts a new parameterization of the Stochastic Blockmodel for directed graphs. 
This new model contains two partitions of the nodes. The first partition corresponds to the 
first form of stochastic equivalence above. The second partition corresponds to the second 
form of stochastic equivalence. In the classical Stochastic Blockmodel, where there is one 
type of stochastic equivalence, the estimation of the blocks is equivalent to clustering. In 
this new model, with two types of stochastic equivalence, the estimation of the two types 
of blocks is equivalent to co-clustering. 



Theorem 4.1 in Section 4, this paper's main result, shows that the clusters from Dl-SIM 
estimate the two different partitions in the Stochastic Co-Blockmodel, even in the high- 
dimensional setting where the number of blocks increases with the number of nodes. In 
other words, under certain conditions, the two sets of clusters from Dl-SIM estimate the 
two different types of stochastic equivalence encoded in the Stochastic Co-Blockmodel. 



Asymptotically in the number of nodes. Theorem 4. 1 bounds the number of "misclustered" 
nodes as long as (1) the spectral gap is not too small and (2) the minimum expected degree 
grows fast enough. These results are analogous to the results on spectral clustering under 
the Stochastic Blockmodel Rohe et al.| [ |2011J . The simulations in Section [5] demonstrate 
that breaking these conditions can drastically diminish the ability of DI-SIM to estimate 
the blocks in the Stochastic Co-Blockmodel. 



The main limitation of Theorem 4.1 is that it does not apply to sparse graphs. Rather, 
this theorem requires that the minimum expected degree grows at the same rate as the 
number of nodes (ignoring logn terms). In large empirical networks, edges are not dense 
enough to suggest this type of asymptotic framework. One area for future research is to 
study the statistical properties of spectral clustering in a sparse graph asymptotic frame- 
work. Previous clustering results for the Stochastic Blockmodel suggest that it is possible 



to cluster the nodes when the graph is sparse; Bickel and Chen [2009| studied the MLE 



and modularity methods under the low dimensional Stochastic Blockmodel, Choi et al 



1 20 12 1 studied the MLE under a hig h dimensional Stoch astic Blockmodel where the num- 
ber of clusters grows like n^/^, and Zhao et al. [2011 1 studied the MLE and modularity 



methods under the low dimensional, degree-corrected Stochastic Blockmodel. In all of 
these papers, the expected degree can grow like log'^^" n. Unfortunately, the clusters esti- 
mated by maximum likelihood and modularity techniques are computationally intensive, 
and potentially NP hard, to compute. Future research should investigate the performance 
of spectral techniques in an asymptotic setting that allows for sparse connections. 
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Appendix A. Directed latent space model 

The following definition of the directed latent space model is motivated by the Aldous- 
Hoover representation for infinite exchangeable arrays and the latent space model pro- 



posed by jHoff et al. [2002|. It specifies the distribution of the random directed adjacency 



matrix A G {0,1}"''". 

Definition 5. The random adjacency matrix A is from the directed latent space model if 

and only if 

i<j 

where {zi,yi}^^i C M'^ x M*^ are pairs of random vectors that are independent across 
i = 1, . . . ,n. 

In this definition, P(Ajj|?/j, zj) is the probability mass function of conditioned on 
Ui and Zj. Define Y E M"^^ such that its ith row is yi for alH G V. Similarly, define 
Z G M"^^ such that its zth row is Zi. Throughout this paper we condition on Y and Z. 

Because P(v4jj = 1\Y, Z) = E,(Aij\Y, Z), the model is then completely parametrized by 
the matrix 

£^ = E{A\Y,Z) G M"^", 
where £/ depends on Y and Z, but this is dropped for notational convenience. 



The Stochastic Blockmodel, introduced by Holland et al. [1983|, is a specific latent 
space model with well defined communities. The following definition extends the Sto- 
chastic Blockmodel to allow for the asymmetric communities discussed in the previous 
section. 

Definition 6. The Stochastic Co-Blockmodel with k blocks is a directed latent space 
model with 

= YBZ^, 

where Y, Z E {0, ij."^'^ both have exactly one 1 in each row and at least one 1 in each 
column and B G [0, 1]'^^'^ is full rank. 

Appendix B. Convergence of Singular Vectors 
The classical spectral clustering algorithm above ca n be divided into tw o steps; (1) find 



the eigendecomposition of L^*^ and (2) run A;-means. Rohe et al. [2011 1 studied the esti- 
mation performance of the classical spectral clustering algorithm under a standard social 
network model. In this analysis, standard perturbation results were not enough to control 
the eigenvectors of the random matrix L^^\ Instead, the paper devised the "squaring trick." 
This paper also utilizes the squaring trick. 

This trick is the composition of two observations. The first observation is easily demon- 
strated on the adjacency matrix of a directed Erdos-Renyi random graph. Under this 
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model, each element of A G {0, l}"^" is a Bernoulli random variable with probability 
p. As a result, Aij is either a zero or a one, independent of the size of the matrix n. 
However, the elements of AA are equal to the sum of n zero-one variables. 



[AA]ij = ^AaAtj. 



By applying concentration results to this sum of (almost) independent random variables, 
standard perturbation theorems imply that the eigenvectors of AA are close to the eigen- 
vectors of E{^A)E{^A). The second observation in the squaring trick is the following: 

Ax = \x =^ AAx = A{Xx) = X^x. 

This shows that any eigenvector of A is also an eigenvector of AA. Taken together, these 
observations show that the eigenvectors of A are close to the eigenvectors of E{A). These 
results are easiest to state for the matrix A under the Erdos-Renyi random graph model. 
However, analogous results hold for the graph Laplacian L*^*) under the more general latent 
space model [Hoff et"aL| [20021 |Rohe et aL| [20TT| . 

To study the singular vectors of L, this appendix first studies the convergence of L^L 
under the directed latent space model. The results in this paper are asymptotic in the 
number of nodes n. When it is appropriate, the matrices above are given a superscript of 
n to emphasize this dependence. Other times, this superscript is discarded for notational 
convenience. 

Recall that 



(B.l) 



min min{ I^. 

j=l,...,n 



(n) 
ii 1 



Because l^'lf' is the expected out-degree for node i and iffH''' is the expected in-degree for 
node i, r„ is the minimum expected degree divided by the maximum possible degree. It 
measures how quickly the number of edges accumulates. 

Theorem B.l. Define the sequence of random matrices A^"^ G {0, 1}"^" to be from a 
sequence of directed latent space models with population matrices ^'''^^ G [0,1]"^". With 
A'^'^\ define the observed graph Laplacian L*^"^ as in ( |2.3[ ). Let =Sf be the population 
version ofL^^^ as defined in Equation (|C.1[). Define Tn as in Equation (|B.1[). 



(n) 



If there exists N > 0, such that r log(r;,) > 2 for all n > N, then 



\og{n) 



n 



1/2 



a.s. 



The proof of Theorem B.l relies on the following lemma. 
Lemma B.l. Under the directed latent space mode, ifn^/"^/ log(n) > 2, then 



P 



32 login) 



n 



1/2 



<g^2-2r2logn_ 
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The same statement holds for \\LL?^ — ^ ^'^\f. 



Proof. The main complication of proving Lemma B.l is controlling the dependencies be- 
tween the elements of l7 L. We do this with an intermediate step that uses the matrix 

and two sets T and A. T constrains the matrices P and O, while A constrains the matrix 
Aff^^A. These sets will be defined in the proof. To ease the notation, define 

PrA(5) = P(fi n r n A) 

where B is some event. 

This proof shows that under the sets T and A the probability of the norm exceeding 
32 log(r7,) n^^/^ is exactly zero for large enough n and that the probability of F or A 
converges to one. To ease notation, define a = 32 log(n) r"^ n"^/^. 

L^L - ^^^\\f >a)< PrA {^[L^L - > j + P ((F n A)^) 

Decompose this first quantity with the union bound. 

(B.2) < J2 {[L^L - > /2n^^ 

+ PrA {[L^L - ^^^\% > a^/2n^) 
To constrain the terms [L'^L — =Sf^^]? , define 



A = Pi < I Y^{AkiAkj - £^ki-^kj)/^kk\ < n ^/^log 

i,j I k=l 



n 



Under A, 



\L^L — = -\ S^iAkiAkj — s^ki'^kj)! ^kk\ < ^ < 7^ 

This controls the second set of terms in line IB. 2[ The next lines show that 

PrA {[L'^L - > aV2n2) = 0. 
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For b{n) = log{n)n^^/'^ define the following set. 



r = fl { G [1 - b{n), 1 + b{n)] & Qi e [1 - b{n), 1 + bin)] 



Define another set. 



[1 - 16b{n), 1 + 16b{n)] 



Lemma B.2. Under the definitions and assumptions for the current proof, T C V. 

A proof of this lemma follows the current proof. 
Under the set F, and thus F', 



\UL-L^L\ 



< 

< 

< 

< 
< 



1 1 



15: 

k 

SI 

k 

El 

k 

E 

k 

16 b{n) 



16 logn 

^2^3/2 

a 

2^' 



ik-^kj 



ik-^kj 



Ckk{OuO,,y/^ J^^kkiffu^nY^' 



16 b{n) 



, -^k{^n^ny/' 
16 6(n) 



Using line B.2, this shows that 



0. 



The remaining step is to bound 

P((FnA)^) < P(F") +P(A"). 
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Using the union bound Hoeffding bound, 



P(n < J2^iOu^^^^[l-b{n),l + b{n)]) + FiCu^^^^[l-bin),l + bin)]) 

i 

< 4nexp(-2r^(logn)^) 

= 47^1 ~2-r2 log n 



Again, using the union bound and the Hoeffding bound. 



kk 



> n loe 



n 



< 



< 



— 2(logn) 



n Y. 



k ^1 ^kk 



^2exp(-2(logn)V) 



< 2n2exp (-2(logn)V^) 

< 2n2"2^''°s". 



Putting the pieces together. 



< + 4n^-2r2 logn ^ 2n2-2^' 



After proving Lemma B.2 



the proof of Lemma B. 1 



is complete. 



□ 
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Proof. This proves Lemma B.2 To simplify the notation, define u{n) = 1 + b{n),l{n) 
1 — b{n). Define the sets 

Notice that T C {r(l) U 1(2)} C r(3). It is sufficient to show r(3) C V. This is true 
because 

1 _ 1 _ b{n)-^-l _ {b{ny^ - l){b{n)-^ + 1) 

«(n)2 ~ (l + 6(n))2 ~ + l)^ ^ + 1)2 ~ {b{n)-^ + 1)2 

. »M:;^ = l-_^>l-16Mn). 

6(n)-i + l 6(n)-i + l ^ ^ 

The 16 in the last bound is larger than it needs to be so that the upper and lower bounds in 
r' are symmetric. For the other direction, 

1 1 6(n)-2 / 1 



1 + 



l{nj^ ~ (1 -6(r2))2 ~ (6(n)-i - 1)2 ~ V K^)-^ - 1 

2 1 
^^6(n)-i-l ^ (6(n)-i-l)2- 

To bound the last two elements, recall that it is assumed y/n/ log(n) > 2. Equivalently, 
1 — b{n) > 1/2 . This yields both of the following: 

{b{n)-^ - 1)2 ^ 6(n)-i - 1 6(n)-i - 1 ~ 1 - b{n) ^ 

Putting these together, 

-^<l + 16 6(n). 

This shows that T cT'. □ 

The following proves Theorem |B.l[ 
Proof. Adding the n super- and subscripts to Lemma B.l it states that if n^^"^/ log(n) > 2, 



then 



P ( ||(L("))^L(") - (^("))^^(")||^ > ^^^^] < 6n2-2-'i°g-. 
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By assumption, for all n > N, r"^ log(n) > 2. This implies that 2 — 2r^ log(n) < —2 for 
all n > A^. Rearranging and summing over n, for any fixed e > 0, 



n=l 



32r-2 log(n)n-V2/e 



n=Af4 
oo 



2-2T2log(n) 



n 



n=N+l 

which is a summable sequence. By the Borel-Cantelli Theorem, 



||(^(n))T^(n) _ = o{t-^ log(n) 71 



-1/21 



a.s. 



□ 



In order to prove Theorem B.2 we need the following version of the Davis-Kahan 
Theorem. 

Proposition B.l. (Davis-Kahan) Let S C be an interval. Denote X as an orthonormal 
matrix whose column space is equal to the eigenspace of A = corresponding to the 
eigenvalues of A contained in S (more formally, the column space ofX is the image of the 
spectral projection of A induced by S). Denote by X the analogous quantity for A = A^. 
Define the distance between S and the spectrum of A outside of S as 

5 = min{|£ — s\;i eigenvalue of A, i ^ S, s G S}. 

IfX and X are of the same dimension, then 

1 



\X-XM\\'p < 



\A-A\\l 
52 



where !M is an orthonormal matrix. With singular value decomposition, define X'^X = 
UEV^. Then, M = UV^. 

The original Davis-Kahan Theorem bounds the "canonical angle," also k nown as the 
"principal angle," between the column spaces of X and X. The appendix in Rohe et al. 
|2011 ] explains how the original theorem can be converted into Proposition B.l 



Proposition B.l aids the following proof of Theorem B.2 



Proof. To show that A;„ = eventually, define (t„i > • • • > o"„„ to be the singular values 
of L*^"^ and define a*^^ > ■ ■ ■ > to be the singular values of =Sf 



(B.3) 



log(n) 



1/2 



a.s. 
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This follows from Weyl's inequality [Bhatia|(2007|, the fact that the Frobenius norm is an 



upper bound of the spectral norm, and Theorem B.l By assumption, 



log(n) (logn 



i2 



Tin 



1/2 ^ 2nV2 = 0{YmYi{5n,5'^}). 



This shows that 

max \ani - o-nJ = o(min{5„, 5'^}). 

i 

By the definition of 5„ and S'^, it follows that eventually, for i = 1, . . . , n. 

Therefore, kn = eventually. 

When kn = the results follow from Proposition B.l To define the matrix use 
singular value decomposition to define (1^)^14 = MniT.nM^2- Then, = MniM^2- 

□ 

The next theorem uses the above lemma in concert with the Davis-Kahan Theorem to 
show that the left and right singular vectors of L*^") converge to the left and right singular 
vectors of J!f^'^\ 

Define the open interval Sn C M to contain the squared singular values of ^("^ that are 
of interest. Define 

(B.4) 6n = min{ I (T^ - s|;cr is a singular value of =Sf^"\ (T ^ s G S} 

(B.5) 6n = inf{|o-^ - s|; a is a singular value of =Sf^"\ s Sn} 

The quantity Sn measures the distance between Sn and the singular values that are not of 
interest. If (5„ is too small, then L*^") might have too many singular values inside Sn- The 
quantity 5^ measures how well Sn insulates the singular values of interest. If 5'^ is too 
small, then some important singular values in L*^") might fall outside of Sn- By restricting 
the rate at which 6n and 5'n converge to zero, the next theorem ensures that the "eigengap" 
is not too small. 

Theorem B.2. Define A*^") G {0, l}"^" to be a sequence of growing random adjacency 
matrices from the directed latent space model with population matrices j^/*^"). Define the 
observed graph Laplacian L^^^ as in (|2.3[). Let be the population version of L'^"'^ as 



defined in Equation (C.l I. 



Define Sn '^as a sequence of intervals. Let kn denote the number of squared singular 
values of L^"^^ that fall within Sn- Let J^n denote the number of squared singular values of 
that fall within Sn- 

Let Vn G M"^^" be an orthonormal matrix where each column is a right singular vector 
of L'^^^ whose corresponding squared singular value falls within Sn- Let Yn ^ M"^-^" 
be an orthonormal matrix where each column is a right singular vector of ^^^^^ whose 
corresponding squared singular value falls within Sn- 
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Assume there exists N G M s uch that for all n > N, > 2/logn. Define 5„ and 5'^ 

If, n^^/^(logn)^ = 0(min{(5ri, 5^}), then eventually 



B.4 



as defined in Equations 
kn = '^n- After that point, 



and 



B.5 



W -f 

\ V rt. ^ r. 



n^n\\F 



logn 



a.s. 



where ^„ is an orthonormal rotation that depends on Vn and 



Appendix C. Clustering 

To rigorously discuss the asymptotic estimation properties of Dl-SIM, the next subsec- 
tions (1) define a population version of the graph Laplacian, (2) examine the behavior of 
Dl-SIM applied to a population version of the graph Laplacian, and (3) compare this to 
Dl-SIM applied to the observed graph Laplacian. 

To give this definition, the next two subsections study DI-SIM applied to ^ and compare 



these results to di-SIM applied to L. This discussion largely follows the discussion in Rohe 



etal. [20111 



C.l. The population version of DI-SIM. Define -c/ = i?( A) as the population version of 
the adjacency matrix A. Recall that under the Stochastic Co-Blockmodel, 

= YBZ^, 

where Y e {0, Z G {0, l}"x^-, and B G [0, Ip^'''-. AH of Section Q will assume 

that ky < kz, without loss of generality. 

Define population versions of O, P, and L all in M"^" as 

(C.l) ^ii = Y.k-^ik 

^ = ^-1/2^^-1/2 

where and ^ are diagonal matrices. 

This subsection shows that Dl-SIM applied to ^ can perfectly identify the blocks in the 
Stochastic Co-Blockmodel. 

To determine the clusters based on common parents, recall that Dl-SIM applied to L, 

(1) finds the right singular vectors U eW^^'^y, 

(2) defines the n rows of f/ as mi, . . . , u„ G M^", 

(3) runs /c-means on ui, . . . , m„ with ky clusters. 

(4) repeats (1) for the the left singular vectors V G M^^^f with k^ clusters. 
This statement of the algorithm uses the assumption that ky < k^. 
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fc-means clusters points mi, in euclidean space by optimizing the following ob- 



jective function [ |Steinhaus[|1956| , 
(C.2) 



min y mi 

{mi,...,nH:y}CR''y ^ " 



mm \\Ui 

9 



m. 



a\\2- 



Define the centroids as the arguments m*, 



that optimize (C.2). The analysis in 



J3 



Xn. It follows 



this paper addresses the true optimum of ( |C.2[ ). 

To examine the results of Dl-SIM applied to =Sf , define 

Dy = diag{BZ^ln) G R'^y'^'y 
D, = diagillYB) e R''''"'^ 

= G [0, 1]'=-^'=% 

where diag{x) for x G i?*^ is a diagonal matrix in R'^^'^ with diag{x] 
that there is an alternative expression for =Sf , 

By the singular value decomposition, there exist orthonormal matrices ^ ,y E R^^^y and 
diagonal matrix A G R^y^^y such that 

(C.3) ^ = ^AY^. 

The next lemma shows that DI-SIM applied to the population Laplacian, Jtf, can dis- 
cover the block structure in the matrices Y and Z. Lemma C.l is essential to defining 
"misclustered." 

Lemma C.l. Under the Stochastic Co-Blockmodel with ky < k^, define =Sf as in Equation 



( |C.1[ ) and define , and A as in Equation C.3 
(1) Ifrank{B) = ky, then 

(C.4) % = % ^yi = Vj, 



where Ui is the ith row ofY and is the ith row of^. 
(2) Ifrank{B) = ky and the columns of B^ are distinct, then 



(C.5) 



where Zi is the ith row of Z and ^ is the ith row of Y. 

y^^y and /i"" G 



This lemma implies that there exist matrices /i^ G 

Y^iy = and Zii^ = r. 



y such that 



Proof. There are two results in Lemma C.l The first result concerns the left singular 
vectors and the second result concerns the right singular vectors. To simplify later results, 
the the proof of the first result is vastly different than the proof of the second result. 
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To prove that 

construct a matrix such that 

Recall that ^ = YB^Z^. Define B^^ = B^Z^Z{B^f. Then, 

= YB-^^Y^. 

Because B-^^ is symmetric, so is (Y'^YY^'^ B-^-^ (Y'^YY^'^ . Further, by the assumptions 
of the Stochastic Co-Blockmodel and the assumption that rank(5) = ky, (F^F) Vs^ifif 
is full rank. By eigendecomposition, there exists an orthonormal matrix f/fc^ G M'^f^'^J' that 
contains the eigenvectors in its columns and a diagonal matrix A' G R.'^^'' (with nonzero 
entries down the diagonal) such that 

Left multiply by Y{Y^Y)-^/^ and right multiply by {Y^Y)-^/^Y^. 

YB^^Y^ = {Yfiy)A'iZfiyf, 



for ij,y = {Y'^Yy'^/'^Uky. Note that the left hand side of Equation C.l is equal to 
Note that {Y ij)')'^Y fi^ is equal to the identity. So, left multiplying Equation C.l by Yfi^ 
shows that the columns of Yfi^ are the left singular vectors of =Sf and the diagonal of A' 
contains the squared singular values. Thus A^ = A', 

^ = Yfiy, and % = yn^y. 

Because detd^y) = det{{Y^Y)-^/^)det{Uk) > 0, j^y e R'^y^'^y is full rank. Therefore, 
(l^y)^^ exists and 

yilj,y = Vjjjy ^yi = Vj. 



The second part of Lemma C.l says that, if rank(i?) = ky and the columns of B-^ are 
distinct, then % = ^ Zi = Zy Notice that 

^ = YB^Z'^ = ^AY^ = YfiyAy^. 

Left multiply by A^^ (fiy)^^ {Y^Y)~^Y^ and take the transpose to get, 

y = z{B'^fiif,y)-YA-\ 

Define fi^ = {B-^Y {{iiy)^^)^ A^^ . So, Zi = Zj =^ Vi = Vj. To prove the other direction, 
let Ziu = 1 and Zj^ = 1 (this means that node i is in the wth block of Z and node j is in 
the fth block of Z) and define B^ as the jth column of B-^ . 

\\y,-y,h = \\z,f^^-z,f^^h = \\iB^-B^nifiy)-T^-% 

> \\B^-B^h\\ifin-'u\A-'\u 
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where ||M||^ = min^,\i^\i^=i\\xM\\2. If M e R"'"', then ||M||^ is the ath largest singular 
value of M. 

\\ii^'r\\m = \m^Yr/^u,x'\\m = miiY^YY^'Wm 

= WiY^YY^'Wm 

This is equal to the square root of the smallest block population in Y. Call this quantity 
V-^min- The other important quantity, ||A^^||m, is 1/ai where ai is the largest singular 
value of ^. This is equal to 1. To see this, notice that ^if^ = ^-^Z^^^" W^^^/^ 
and the row sums of G^^s^ are contained down the diagonal of 0^ . So, =Sf^^ can be 
thought of as a classical symmetric graph Laplacian (from symmetric graphs). The largest 
eigenvalue of a symmetric graph Laplacian is 1 [von Luxburg 2007[ . This implies that 
= 1- Putting the pieces together. 



(C.6) \\%-n\2>\lPLn\\Bl-Kh. 

So, if the columns of are unique and u ^ v, then — 1^ II2 > 0. The contrapositive 
implies that % = ^ Zi = Zj. □ 



Equivalence statements ( |C.4[ ) and ( |C.5[ ) imply that, under the Stochastic Co-Blockmodel 
with certain conditions, there are ky unique rows in the right singular vectors, Y fi^, of ^ 
and kz unique rows in the left singular vectors, Zfi^ of ^ . This has important conse- 
quences for Dl-SIM. Dl-SIM applied to ^ will run /c-means on the rows of F/x^. Because 
there are only ky unique points, each of these points will be a centroid of one of the re- 
sulting clusters. So, if yj/x^ = yjjj}', then i and j will be assigned to the same cluster. 



With equivalence statement ( |C.4[ ), this implies that DI-SIM applied to the matrix ^ can 
perfectly identify the block memberships in Y . Similarly, DI-SIM applied to the matrix ^ 
can perfectly identify the block memberships in Z. Obviously, ^ is not observed. We 
need to estimate these memberships from the matrix L. 

C.2. Comparing the population and observed clusters. Let U G W'^^y be a matrix 
whose orthonormal columns are the right singular vectors corresponding to the largest ky 
singular values of L. Dl-SIM applies A;-means (with ky clusters) to the rows of U , the 
points Ml, . . . , M„. Each row is assigned to one cluster and each cluster has a centroid. 

Definition 7. For i = 1, . . . ,n, define c" G M^" to be the centroid corresponding to Ui. 

Recall that yifi^ is the centroid corresponding to node i from the population analysis. 
If the observed centroid is closer to the population centroid than it is to any other 
population centroid yjjj)^ for yj 7^ then it appears that node i is correctly clustered. This 
definition is appealing because it removes one aspect of the cluster identifiability problem; 
instead of labeling the clusters 1, . . . ,ky, the clusters are labeled by points in euclidean 
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space (the rows of /i^). Unfortunately, the singular vectors impart one additional source 
of unidentifiab ility. They are only identifiable up to a rotation. The orthonormal rotation 
from Theorem B.2 ^ G MJ^y^'^y, can address this technical nuisance. Consider node i to 



be correctly clustered if, c" is closer to yijj.'^^ than it is to any other (rotated) population 
centroid yj/J^^ for yj ^ yi. The slight complication with ^ stems from the fact that the 
vectors c", . . . , cj^ are constructed from the singular vectors in U and Theorem B.2 in the 
Appendix shows that, under certain conditions, the singular vectors of L converge to an 
orthonormal rotation of the singular vectors of =Sf . In other words, the singular vectors U 
converge to the rotated population eigenvectors: Y [i^S^.. 
Define P^ax be the population of the largest block in Y . 



(C.7) 



py 



.max {Y^Y),, 

J — l,...,Ky 



The following lemma motivates the definition of misclustered by providing a sufficient 
condition for a node to be correctly clustered. 



Lemma C.2. For any orthonormal matrix , 



5 



(C.8) 
(C.9) 



< 1 



■^J m.c 



y^^^"h < \\Ci 



%/"^ll2 for any yj ^ y^. 



Proof. Statement ( |C.10[ ) is the essential ingredient to prove Lemma C.2 
(CIO) y, ^ y,, then - y./i^lb > 

The proof of statement ( |C.10[ ) requires the following definition. 



max 



Notice that 



So, 



\r' \\m 



a;:||3;||2=l 



min xfiyifi^fx'^ = min x(r^r)"^a;^ = 

^■l|2:||2=l a;:||x||2=l 



y 

max ) 



Proving statement ( |C.10[ ). The proof of Lemma C.2 follows 



py 

J- m,( 



1 



^ \ 1 m.< 



V max 
□ 



Line (C.9|) is the previously motivated definition of correctly clustered. Thus, Lemma 



C.2 shows that inequality (C.8[) is a sufficient condition for node i to be correctly clustered 



in the Y block. 
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Definition 8. For the orthonormal rotation ^ from Proposition B.l\ define the set of y- 
misclustered nodes as the nodes that do not satisfy (|C.8[), 



(C.ll) 



> 1 



^J- mc 



Thus far, this subsection has motivated a definition of y-misclustered. Because ky < k^, 
it is slightly more challenging to give a definition for z-misclustered. 

Let V G M"^*^!' be a matrix whose orthonormal columns are the left singular vectors 
corresponding to the largest ky singular values of L. 

Definition 9. For i = 1, . . . ,n, define G M''" to be the centroid corresponding to Vi. 



To give the analogue to Lemma C.2 define 

PI, = . min {Y' Y) 

define B-^ as the jth column of B-^ , and define 
(C.12) 



J31 



Pl,„ min \\B^ 
G 



mm II " -i 



Lemma C.3. For any orthonormal matrix . 
(C.13) ||cJ'^-z,^^||2<7./2 =^ 

(C.14) \\CiM - Zifi'^y < \\Ci^ - Zjfi^y for any Zj ^ Zi 



Proof Iffi^Yj, then using equation C.6 



\Ziix - ZjH 



\\B 

mm" ■ 



and the proof of Lemma C.3 follows, 

||c-^-2;j/i^||2 > Wzin"" -Zjn''\\2-\\dl 



■Zifl 



> 7,-7^/2 = 7,/2 > llc^i 



■Zifl \\2. 

□ 



Definition 10. For the orthonormal rotation M from Proposition B. 1 define the set of 
z-misclustered nodes as the nodes that do not satisfy ( |C.13| ), 

(C.15) = {z: ||cj'^-z,/i^||2 >7./2}. 

It follows from Lemma C.l that has ky 
BJ\ define Sn = [o-| /2, oo). Similar 



Proof. This is a proof of Theorem 



4.1 



nonzero singular values. In order to invoke Theorem 
to the argument in the proof of Theorem B.2[ by Weyl's inequality. Theorem |B.l . and 
the assumptions of Theorem 



4.1 



the top ky singular values of will eventually be 
contained in This definition of Sn implies that (5„ = = /2. 
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Define a partition matrix to be a matrix of all zeros except for a single 1 in each row. 
Define 

^(n, A;) = jfr : where Y G R'''"' is a partition matrix and T G M'^^'^j . 
Notice that 

min llf/ — M||p= min } min IIm,, — mglln 

i 

where the right hand side of the equation is the /c-means objective function applied to the 
rows of V. Define 

Cu = argmin^;g,^(„^fc)||f/ - M|||. 

Notice that the ith row of Cf/ is c". Further, C[/^ = argmin^^^^^^^ j^^-, ||f/^— M|||,. Because 
YfiyeM{n,k), 

\\CuM - F/i^'ll^ < \\Cu^ - U^Wf + \\U^ - Y^^y\\F < 2\\U^ - Y/j^Wf. 
Therefore, 

< 2PlJ\Cu^-Yi^y\\l 

< 4PIJ\U^-Yf,y\\l 

The last line follows from Theorem IB .21 
Similar statements hold for Z. Define 

Cv = argmin^g^(„ fc)||V - M\\l. 

Thus, similar to before, 

\\Cv^ - Z^i\\f < 2\\VM - Z^'Wf. 



Therefore, 



(logn)^ 



□ 
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Next is a proof of Proposition 4.1 
Proof. To simplify 



7. = VPl„min||i?.f 

first simplify = Dy^^'^BDz^^'^. The (z, i)th element of Dz^^"^ is 

1 1 



Similarly for Dy ^^'^ . So, 



Thus, 



= —B. 

UTn 



_ V mm • IIP n II 



□ 



The following is a proof of Corollary [41 



Proof. Fromtheproof of Lemma C.l =Sf^^ has the same eigenvalues as 

and these values are the squared singular values of Under the four parameter Stochastic 

Co-Blockmodel, 

{Z-Zfl^B,,{Z-ZY'^ = ^^-1-^ {p% + i2p + r)riar) • 

The constant vector is an eigenvector of this matrix. It has eigenvalue 

2 p"^ + 2kpr + kr"^ 
^ (kr + p)"^ 

Any vector orthogonal to a constant vector is also an eigenvector. They all have eigenvalue 

a' = = 1 

{kr + p)'^ {k{r/p) + l)'^ 

Notice that r > r. If k = 0{n^^^/ logn), then the conditions of Theorem 
Substituting the values of P = s and above, 

\.y£y\ = o ( ^ ^ j = o(r(logn)^). 

Because ky = k^, can be defined as 



4.1 



are satisfied. 
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where P^^^ 
result for ^ 



= s. This definition is analogous the definition of . 
above also holds for 



'(y (Equation C.l 1 1. The 

□ 
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